为了在盲图超级分辨率(SR)上取得有希望的结果,一些尝试利用低分辨率(LR)图像来预测内核并改善SR性能。但是,由于不可用的现实世界模糊内核,这些监督的内核预测(SKP)方法是不切实际的。尽管提出了一些无监督的降解预测(UDP)方法来绕过此问题,但\ textIt {contercestency}之间的降解嵌入和SR功能之间仍然具有挑战性。通过探索降解嵌入与SR功能之间的相关性,我们观察到共同学习内容和降解感知功能是最佳的。基于此观察结果,提出了一个名为CDSR的内容和退化的SR网络。具体而言,CDSR包含三个新建立的模块:(1)将基于重量的编码器(LPE)应用于共同提取内容和降解功能; (2)采用基于域查询的基于注意力的模块(DQA)来适应不一致; (3)基于密码的空格压缩模块(CSC),可以抑制冗余信息。对几个基准测试的广泛实验表明,即使与最先进的SKP方法相比,提议的CDSR的表现都优于现有的UDP模型,并在PSNR和SSIM上实现竞争性能。
translated by 谷歌翻译
由于具有强大的功能学习能力和高效率,深层哈希在大规模图像检索中取得了巨大的成功。同时,广泛的作品表明,深层神经网络(DNN)容易受到对抗例子的影响,并且探索针对深哈希的对抗性攻击吸引了许多研究工作。然而,尚未对Backdoor攻击(对DNNS的另一个著名威胁)进行深入研究。尽管图像分类领域已经提出了各种后门攻击,但现有方法未能实现真正的不可思议的后门攻击,该攻击享受着隐形触发器并同时享受清洁标签设置,而且它们也无法满足图像检索后门的内在需求。在本文中,我们提出了Badhash,这是第一个基于生成的无透感的后门攻击,对深哈希的攻击,它可以有效地用干净的标签产生隐形和投入特定的中毒图像。具体而言,我们首先提出了一种新的条件生成对抗网络(CGAN)管道,以有效生成中毒样品。对于任何给定的良性图像,它试图产生具有独特无形扳机的自然中毒对应物。为了提高攻击效果,我们引入了基于标签的对比学习网络LabCln来利用不同标签的语义特征,随后将其用于混淆和误导目标模型以学习嵌入式触发器。我们终于探索了在哈希空间中对图像检索的后门攻击的机制。在多个基准数据集上进行的广泛实验证明,Badhash可以生成不察觉的中毒样本,具有强大的攻击能力和对最新的深层哈希方案的可转移性。主要主题领域:[参与]多媒体搜索和建议
translated by 谷歌翻译
尽管深度神经网络模型在各种应用程序中表现出出色的性能,但它们的较大模型大小和广泛的浮点操作使移动计算平台上的部署成为主要挑战,尤其是在物联网设备上。一种吸引人的解决方案是模型量化,可降低模型大小并使用微控制器通常支持的整数操作。为此,1位量化的DNN模型或深二进制神经网络可最大化存储效率,其中BNN模型中的每个参数仅具有1位。在本文中,我们提出了一个可重构的BNN(RBNN),以进一步扩大资源约束的物联网设备的内存效率。通常,可以根据需要重新配置RBNN,以实现具有相同参数集的M(m> 1)不同的任务,因此只有一个任务决定了内存要求。换句话说,通过时间M改善了内存利用率。我们的广泛实验证实了多达七个常用的任务可以共存(M的值更大)。这些具有不同类别的任务在三个二氧化流行的DNN体系结构(包括VGG,Resnet和ReactNet)上没有准确性或微不足道的准确性下降。这些任务跨越了不同域,例如本文验证的计算机视觉和音频域,并以模型体系结构可以服务于这些跨域任务的先决条件。为了保护RBNN模型的知识属性,可以通过用户密钥和由固有硬件指纹生成的设备唯一的根键来控制重新配置。通过这样做,RBNN模型只能使用每个授权设备的每个付费用户使用,从而使用户和模型提供商受益。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Through a study of multi-gas mixture datasets, we show that in multi-component spectral analysis, the number of functional or non-functional principal components required to retain the essential information is the same as the number of independent constituents in the mixture set. Due to the mutual in-dependency among different gas molecules, near one-to-one projection from the principal component to the mixture constituent can be established, leading to a significant simplification of spectral quantification. Further, with the knowledge of the molar extinction coefficients of each constituent, a complete principal component set can be extracted from the coefficients directly, and few to none training samples are required for the learning model. Compared to other approaches, the proposed methods provide fast and accurate spectral quantification solutions with a small memory size needed.
translated by 谷歌翻译
Body Mass Index (BMI), age, height and weight are important indicators of human health conditions, which can provide useful information for plenty of practical purposes, such as health care, monitoring and re-identification. Most existing methods of health indicator prediction mainly use front-view body or face images. These inputs are hard to be obtained in daily life and often lead to the lack of robustness for the models, considering their strict requirements on view and pose. In this paper, we propose to employ gait videos to predict health indicators, which are more prevalent in surveillance and home monitoring scenarios. However, the study of health indicator prediction from gait videos using deep learning was hindered due to the small amount of open-sourced data. To address this issue, we analyse the similarity and relationship between pose estimation and health indicator prediction tasks, and then propose a paradigm enabling deep learning for small health indicator datasets by pre-training on the pose estimation task. Furthermore, to better suit the health indicator prediction task, we bring forward Global-Local Aware aNd Centrosymmetric Encoder (GLANCE) module. It first extracts local and global features by progressive convolutions and then fuses multi-level features by a centrosymmetric double-path hourglass structure in two different ways. Experiments demonstrate that the proposed paradigm achieves state-of-the-art results for predicting health indicators on MoVi, and that the GLANCE module is also beneficial for pose estimation on 3DPW.
translated by 谷歌翻译
Text classification, a core component of task-oriented dialogue systems, attracts continuous research from both the research and industry community, and has resulted in tremendous progress. However, existing method does not consider the use of label information, which may weaken the performance of text classification systems in some token-aware scenarios. To address the problem, in this paper, we introduce the use of label information as label embedding for the task of text classification and achieve remarkable performance on benchmark dataset.
translated by 谷歌翻译
Recent research has reported a performance degradation in self-supervised contrastive learning for specially designed efficient networks, such as MobileNet and EfficientNet. A common practice to address this problem is to introduce a pretrained contrastive teacher model and train the lightweight networks with distillation signals generated by the teacher. However, it is time and resource consuming to pretrain a teacher model when it is not available. In this work, we aim to establish a stronger baseline for lightweight contrastive models without using a pretrained teacher model. Specifically, we show that the optimal recipe for efficient models is different from that of larger models, and using the same training settings as ResNet50, as previous research does, is inappropriate. Additionally, we observe a common issu e in contrastive learning where either the positive or negative views can be noisy, and propose a smoothed version of InfoNCE loss to alleviate this problem. As a result, we successfully improve the linear evaluation results from 36.3\% to 62.3\% for MobileNet-V3-Large and from 42.2\% to 65.8\% for EfficientNet-B0 on ImageNet, closing the accuracy gap to ResNet50 with $5\times$ fewer parameters. We hope our research will facilitate the usage of lightweight contrastive models.
translated by 谷歌翻译
Estimating the probability of failure for complex real-world systems using high-fidelity computational models is often prohibitively expensive, especially when the probability is small. Exploiting low-fidelity models can make this process more feasible, but merging information from multiple low-fidelity and high-fidelity models poses several challenges. This paper presents a robust multi-fidelity surrogate modeling strategy in which the multi-fidelity surrogate is assembled using an active learning strategy using an on-the-fly model adequacy assessment set within a subset simulation framework for efficient reliability analysis. The multi-fidelity surrogate is assembled by first applying a Gaussian process correction to each low-fidelity model and assigning a model probability based on the model's local predictive accuracy and cost. Three strategies are proposed to fuse these individual surrogates into an overall surrogate model based on model averaging and deterministic/stochastic model selection. The strategies also dictate which model evaluations are necessary. No assumptions are made about the relationships between low-fidelity models, while the high-fidelity model is assumed to be the most accurate and most computationally expensive model. Through two analytical and two numerical case studies, including a case study evaluating the failure probability of Tristructural isotropic-coated (TRISO) nuclear fuels, the algorithm is shown to be highly accurate while drastically reducing the number of high-fidelity model calls (and hence computational cost).
translated by 谷歌翻译
In natural language processing (NLP), the context of a word or sentence plays an essential role. Contextual information such as the semantic representation of a passage or historical dialogue forms an essential part of a conversation and a precise understanding of the present phrase or sentence. However, the standard attention mechanisms typically generate weights using query and key but ignore context, forming a Bi-Attention framework, despite their great success in modeling sequence alignment. This Bi-Attention mechanism does not explicitly model the interactions between the contexts, queries and keys of target sequences, missing important contextual information and resulting in poor attention performance. Accordingly, a novel and general triple-attention (Tri-Attention) framework expands the standard Bi-Attention mechanism and explicitly interacts query, key, and context by incorporating context as the third dimension in calculating relevance scores. Four variants of Tri-Attention are generated by expanding the two-dimensional vector-based additive, dot-product, scaled dot-product, and bilinear operations in Bi-Attention to the tensor operations for Tri-Attention. Extensive experiments on three NLP tasks demonstrate that Tri-Attention outperforms about 30 state-of-the-art non-attention, standard Bi-Attention, contextual Bi-Attention approaches and pretrained neural language models1.
translated by 谷歌翻译